reesehs_OriginalHomeworkCode_03
## Using libcurl 8.1.2 with LibreSSL/3.3.6
f <- curl("https://raw.githubusercontent.com/fuzzyatelin/fuzzyatelin.github.io/master/AN588_Fall23/zombies.csv") #imports data about zombies
d <- read.csv(f, header = TRUE, sep = ",", stringsAsFactors = FALSE) #reads data and creates data frame from it
head(d) #returns data frame## id first_name last_name gender height weight zombies_killed
## 1 1 Sarah Little Female 62.88951 132.0872 2
## 2 2 Mark Duncan Male 67.80277 146.3753 5
## 3 3 Brandon Perez Male 72.12908 152.9370 1
## 4 4 Roger Coleman Male 66.78484 129.7418 5
## 5 5 Tammy Powell Female 64.71832 132.4265 4
## 6 6 Anthony Green Male 71.24326 152.5246 1
## years_of_education major age
## 1 1 medicine/nursing 17.64275
## 2 3 criminal justice administration 22.58951
## 3 1 education 21.91276
## 4 6 energy studies 18.19058
## 5 3 logistics 21.10399
## 6 4 energy studies 21.48355
Question One: Calculate the population mean and standard deviation for each quantitative random variable (height, weight, age, number of zombies killed, and years of education). NOTE: You will not want to use the built in var() and sd() commands as these are for samples.
#height
height <- (d$height) #takes data from the column height in the data from d
mh <- mean(height) #takes the mean of the height data
sqrt(sum((height - mean(height))^2)/length(height)) #calculates the standard deviation of height data## [1] 4.30797
## [1] 18.39186
## [1] 2.963583
#number of zombies killed
zk <- (d$zombies_killed)
mzk <- mean(zk)
sqrt(sum((zk - mean(zk))^2)/length(zk))## [1] 1.747551
#years of education
yoedu <- (d$years_of_education)
myoedu <- mean(yoedu)
sqrt(sum((yoedu - mean(yoedu))^2)/length(yoedu))## [1] 1.675704
Question 2: Use {ggplot} to make boxplots of each of these variables by gender
library(ggplot2)
p <- ggplot(data = d, aes(x = gender, y = height)) #create graph with data from gender in x axis and data from height in y axis, all data is taken from the data frame d
p <- p + geom_boxplot() #create boxplot
p <- p + theme(axis.text.x = element_text(angle = 90)) #shift angle of x axis
p <- p + ylab("Height") #labels y axis
pw <- ggplot(data = d, aes(x = gender, y = weight))
w <- w + geom_boxplot()
w <- w + theme(axis.text.x = element_text(angle = 90))
w <- w + ylab("Weight")
wa <- ggplot(data = d, aes(x = gender, y = age))
a <- a + geom_boxplot()
a <- a + theme(axis.text.x = element_text(angle = 90))
a <- a + ylab("Age")
azn <- ggplot(data = d, aes(x = gender, y = zk))
zn <- zn + geom_boxplot()
zn <- zn + theme(axis.text.x = element_text(angle = 90))
zn <- zn + ylab("Number of Zombies Killed")
zney <- ggplot(data = d, aes(x = gender, y = yoedu))
ey <- ey + geom_boxplot()
ey <- ey + theme(axis.text.x = element_text(angle = 90))
ey <- ey + ylab("Years of Education")
eyQuestion 4: Using histograms and Q-Q plots, check whether the quantitative variables seem to be drawn from a normal distribution. Which seem to be and which do not (hint: not all are drawn from the normal distribution)? For those that are not normal, can you determine from which common distribution they are drawn?
qqnorm(height, main = "Normal QQ plot random normal variables") #creates qqnorm graph to test normal distribution
qqline(height, col = "gray") #create line that is the normal distributionQuestion 5: Now use the sample() function to sample ONE subset of 30 zombie survivors (without replacement) from this population and calculate the mean and sample standard deviation for each variable. Also estimate the standard error for each variable, and construct the 95% confidence interval for each mean. Note that for the variables that are not drawn from the normal distribution, you may need to base your estimate of the CIs on slightly different code than for the normal…
#sample(d, 30, replace = FALSE)
#in this case the length of d is only 10 so this doesn't work
#height
hs <- sample(height, 30, replace = FALSE) #takes a random sample of 30 from the height dataset
mhs <- mean(hs) #takes mean of the 30 samples
sdhs <- sd(hs) # I'm honestly a little confused on the difference between sample standard deviation and standard error?
hsse <- sdhs/sqrt(30) #takes standard error of the 30 samples
lower <- mh - qnorm(1 - 0.05/2) * hsse
upper <- mh + qnorm(1 - 0.05/2) * hsse
ci <- c(lower, upper) #caclulates the confidence interval
ci## [1] 66.19041 69.06979
## [1] 147.5056 128.0654 123.6400 169.0869 156.2692 147.5562 184.3939 155.7613
## [9] 129.9026 163.0758 116.9201 182.6067 157.0047 138.1399 151.6608 136.0072
## [17] 144.9560 164.8547 162.3514 143.8121 126.0938 132.4048 141.5002 172.5778
## [25] 120.1112 132.5271 148.8366 147.7249 133.5535 149.3586
## [1] 17.64022
## [1] 3.220649
lower <- mw - qnorm(1 - 0.05/2) * sews # (1-alpha)/2 each in the upper and lower tails of the distribution
upper <- mw + qnorm(1 - 0.05/2) * sews # (1-alpha)/2 each in the upper and lower tails of the distribution
ci <- c(lower, upper)
ci## [1] 137.5951 150.2198
## [1] 19.73937 22.01716 14.21540 16.71759 23.88692 15.55708 19.19267 21.66693
## [9] 18.42327 17.09370 19.84517 18.03707 14.54944 25.35665 16.78826 17.89542
## [17] 18.74854 19.97787 21.13508 18.53755 18.88307 16.16114 18.82747 21.48753
## [25] 19.16099 19.92937 18.63669 18.91888 16.50589 15.41400
## [1] 2.568164
## [1] 0.4688804
lower <- ma - qnorm(1 - 0.05/2) * seas # (1-alpha)/2 each in the upper and lower tails of the distribution
upper <- ma + qnorm(1 - 0.05/2) * seas # (1-alpha)/2 each in the upper and lower tails of the distribution
ci <- c(lower, upper)
ci## [1] 19.12797 20.96595
#for both zombies killed and years of education because they both were not normal distributions the central limit theorem will be used because these are samples and it can simulate a normal distribution - but honestly I don't know how to do that so any help would be appreciated! Or maybe is it supposed to be a t distribution?
#zombies killed
zks <- sample(zk, 30, replace = FALSE)
zks## [1] 2 1 1 3 5 0 5 5 1 1 4 5 2 3 5 3 5 2 3 3 3 2 4 5 2 2 3 3 3 3
## [1] 1.449931
## [1] 0.2647199
lower <- mzk - qnorm(1 - 0.05/2) * sezks # (1-alpha)/2 each in the upper and lower tails of the distribution
upper <- mzk + qnorm(1 - 0.05/2) * sezks # (1-alpha)/2 each in the upper and lower tails of the distribution
ci <- c(lower, upper)
ci## [1] 2.473159 3.510841
## [1] 4 1 4 4 6 1 5 3 2 3 3 6 3 5 0 3 2 2 4 4 3 2 4 0 6 5 3 3 3 3
## [1] 1.590561
## [1] 0.2903954
lower <- myoedu - qnorm(1 - 0.05/2) * seedu # (1-alpha)/2 each in the upper and lower tails of the distribution
upper <- myoedu + qnorm(1 - 0.05/2) * seedu # (1-alpha)/2 each in the upper and lower tails of the distribution
ci <- c(lower, upper)
ci## [1] 2.426835 3.565165
Question 6: Now draw 99 more random samples of 30 zombie apocalypse survivors, and calculate the mean for each variable for each of these samples. Together with the first sample you drew, you now have a set of 100 means for each variable (each based on 30 observations), which constitutes a sampling distribution for each variable. What are the means and standard deviations of this distribution of means for each variable?
#height
hs2 <- NULL # sets up a dummy variable
n <- 30 #number of samples
for (i in 1:99) {
hs2[[i]] <- mean(sample(height, n, replace = FALSE))
} #draws 99 samples with 30 random numbers from height data set and takes the mean of each sample set
hs2 <- append(hs2, mhs, after = 99) #adds the mean from question 5 to the list
hs2 <- unlist(hs2) #unlist the variable
mean(hs2) # takes the mean## [1] 67.5907
## [1] 0.7380481
#weight
ws2 <- NULL
for (i in 1:99) {
ws2[[i]] <- mean(sample(weight, n, replace = FALSE))
}
ws2 <- append(ws2, mws, after = 99)
ws2 <- unlist(ws2)
mean(ws2)## [1] 144.6373
## [1] 3.192088
#age
as2 <- NULL
for (i in 1:99) {
as2[[i]] <- mean(sample(age, n, replace = FALSE))
}
as2 <- append(as2, mas, after = 99)
as2 <- unlist(as2)
mean(as2)## [1] 20.03221
## [1] 0.6214387
#zombies killed
zks2 <- NULL
for (i in 1:99) {
zks2[[i]] <- mean(sample(zk, n, replace = FALSE))
}
zks2 <- append(zks2, mzks, after = 99) #adds the mean from the sample in question 5
zks2 <- unlist(zks2)
mean(zks2)## [1] 2.976667
## [1] 0.3188257
#years of education
yedus2 <- NULL
for (i in 1:99) {
yedus2[[i]] <- mean(sample(yoedu, n, replace = FALSE))
}
yedus2 <- append(yedus2, medus, after = 99)
yedus2 <- unlist(yedus2)
mean(yedus2)## [1] 2.993667
## [1] 0.2706228
How do the standard deviations of means compare to the standard errors estimated in [5]?
All of the sds of the means are a little higher than the standard error.
What do these sampling distributions look like (a graph might help here)? Are they normally distributed? What about for those variables that you concluded were not originally drawn from a normal distribution?
#height
hist(hs2, probability = TRUE) #create histogram from sample distribution calculated in chunk aboveqqnorm(hs2, main = "Normal QQ plot random normal variables") #plots qqnorm graph based of hs2 data
qqline(hs2, col = "gray") #plot qqline # Most of the variables are normally distributed, of course this is a little variation like in the number of zombies killed. But compared to the original histograms of the population the qqline graph seem to follow the line more, especially in years of education and number of zombies killed, which were originally not. Challenges Faced
Figuring out the formula to calculate standard deviation was hard for me to find and figure out how to use. But once I found it in the modules it was pretty straightforward.
Initially both of my ggplot scatter plots would not run and I could not figure out what was going wrong. After 30 minutes of reinstalling r studio and restarting my computer I figured out that I was just simply missing a ‘)’
Figuring out the difference between standard error and standard deviation is something I still don’t understand but I think my code is calculating the right numbers
I could not figure out how to take a sample from my data set d because its length was only 10 so I eventually came to the conclusion that I had to take samples from each of the variable columns I would be looking at.
I worked on trying to get 99 samples of size 30 for so long. But in the end and after staring at my code for many hours I took samples of each variable in a for loop